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Abstract. We describe a cache-friendly version of van der Hoeven's truncated 
FFT and inverse truncated FFT, focusing on the case of 'large' coefficients, 
such as those arising in the Schonhage-Strassen algorithm for multiplication 
in Z[x]. We describe two implementations and examine their performance. 



1. Introduction 

In typical implementations of the FFT method for dense univariate polyno- 
mial multiplication, the input polynomials are zero-padded up to an appropriate 
power-of-two length, causing a jump in the running time when the lengths cross 
a power-of-two boundary. Van der Hoeven recently described a multiplication al- 
gorithm that greatly reduces the size of these jumps, by introducing a novel TFT 
(truncated FFT) and ITFT (inverse truncated FFT), achieving relatively smooth 
performance without sacrificing the simplicity of a power-of-two transform length 
|vdH04ll?dH05] . 

However, the transforms that he describes suffer from suboptimal locality. The 
transforms follow the divide-and-conquer FFT paradigm, recursively splitting the 
problem into two half-sized transforms. If the transform length is 2^, and only 2'' 
coefficients fit into a given level of cache, then only the deepest k layers of the 
transform take advantage of that cache; the remaining £ — k layers do not. 

In this paper we address this difficulty, achieving superior locality by reordering 
the sequence of butterfly operations in van der Hoeven's transforms. Our strategy 
is similar to Bailey's algorithm |Bai90| . Bailey rearranges the data into a 2^^ x 2^^ 
matrix, where ^1+^2 = £, and then rewrites the transform as 2^^ column transforms 
of length 2^1 followed by 2^^ row transforms of length 2^^ . The divide-and-conquer 
algorithm may be regarded as the special case where £1 = 1 and £2 = £ — I. 
However, when £i « £/2, the working set for each row and column is only about 
2^/^ coefficients, greatly improving the algorithm's locality. This method can of 
course be applied recursively, until the working set for each subtransform fits into 
the lowest level of cache, making efficient use of the entire memory hierarchy. 

It is straightforward to adapt this idea to the TFT, obtaining a decomposition 
of the TFT into TFTs of half the depth (^. The corresponding decomposition of 
the ITFT is more involved; it becomes necessary to alternate between ITFTs on 
the rows and columns in a slightly complicated way (Q. 

In ^|5]we discuss the performance of two implementations. The first is an imple- 
mentation of the Schonhage-Strassen algorithm jSS71j for multiplication in Z[x]. 
The second is an implementation of the Schonhage-Nussbaumer convolution al- 
gorithm |Sch77|, lNus80| for the case of {Z/mZ)[x] where m is an odd word-sized 
modulus. In both cases the Fourier coefficients occupy relatively large blocks of 
memory. A natural question is whether the new algorithms are suitable for the 
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more conventional case of 'small' coefficients, such as double-precision real or com- 
plex coefficients. We offer some speculation in S}6] although we have not attempted 
an implementation. 

2. Notation and setup 

Let i? be a commutative ring in which 2 is invertible. We assume that R contains 
a principal M-th root of unity lo, where M = 2'" for some integer m > 1; this means 
that uj^'-' — 1 and moreover that X^i^f^o ^ — ^ ^^r all < j < M. We have in mind 
examples like R — Z/(2^^/-^ -I- 1)Z and uj ~ 2, which appears in the Schonhage- 
Strassen algorithm for multiplication in Z[x]. 

If L I M, we denote by lu^ the principal i-th root of unity uj^-^^^; we then have 
the compatibility relation {lol')^ = '^l foi' any L \ L' \ M. 

Now suppose that L | M, L = 2^ and let C G -R"" • Let (ag, . . . , aL-i) € R^ ■ The 
(weighted) discrete Fourier transform (DFT) is defined by 

L-l 

(1) a, = C^'^a;fa„ < j < L, 

1=0 

where f denotes the length-^ bit-reversal of j. 

We define the truncated Fourier transform (TFT) as follows. Let 1 < z < L and 
1 < n < L, and suppose that = • • • = a^-i = 0. Then 

TFT(L, C, z, n; (oq, . . . , a^^i)) (ao, . . . , a„_i). 

In other words, the TFT computes a prescribed initial segment of the transform, 
assuming that some prescribed final segment of the untransformed data is zero (see 
Figure [T]) . 
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Figure 1. The TFT. 

The definition of the inverse truncated Fourier transform (ITFT) is more in- 
volved. Let / G {0, 1}. Suppose that 1 < z < L and I < n + f < L, and moreover 
that z > n. Suppose as before that az = ■ ■ ■ = a^-i = 0. Then 

ITFT(L, C, z, n, /; (ao, . . . , a„_i, La„, . . . , La^^i)) 

_ I {Lao, ■ ■ ■,Lan^i) / = 0, 
|(Lao, . . . ,La„_i,a„) / = 1. 

In other words, the ITFT takes as input an initial segment of the transformed data 
together with the complementary final segment of the untransformed data (some 
components of which are known to be zero) , and returns the initial segment of the 
untransformed data, and optionally (if / = 1) the next transformed coordinate 
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(see Figure |2|. When 2 = n = i, / = and C = 1, the TFT and ITFT reduce 
to the usual DFT and inverse DFT. with inputs in normal order and outputs in 
bit-reversed order. 
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Figure 2. The ITFT. 

It is not obvious a priori that the ITFT is well-defined, and in particular that the 
coordinates ao, . . . , an^i,an, ■ ■ ■ , ol-i are linearly independent. Van der Hoeven 
deduced this from the correctness of his algorithm for computing the ITFT; it will 
follow in the same way from the proof of correctness of our cache-friendly ITFT 
algorithm in f|4] 

Van der Hoeven allowed the input and output coordinates to come from a wider 
class of subsets of {0, . . . , L — 1}. In this paper we restrict ourselves to the initial 
and final segments mentioned above, which suffices for our intended application to 
univariate polynomial multiplication. 

The TFT and ITFT may be used to deduce a polynomial multiplication al- 
gorithm in R[X] as follows. Suppose that g,h ^ R[X], and let u = gh. Let 
zi = I + degg, Z2 = 1 + deg/i, n — zi + Z2 — I, and assume that n < L. Let 
50i---iffzi-i be the coefficients of g and Hq, . . . ^h^^-i be the coefficients of h. 
Compute 

(50, ■ ■ • = TFT(L, l,zi,n; (50,- ■ -,5^1-1)), 

{ho, hn-i) = TFT(L, 1, 23, n; (/iq, . . . , /iz^-i)), 

and then compute — cjihi in R for Q < i < n. Then uq, . . . , Un-i are the first n 
Fourier coefficients of m, and moreover u„ = • • • = = since n — degu + 1. 

Therefore we recover u via 

(iuo, . . . , Lun-i) = ITFT(L, 1, n, n, 0; (uo, . . . , u„_i)). 

(This multiplication algorithm has not used the parameters / or <^ in a nontrivial 
way; these enter the picture when the algorithms are called recursively in fjS] and 

The standard FFT algorithms compute the DFT (or inverse DFT) using lL/2 
'butterfly operations'. In contrast, van der Hoeven showed that the TFT and 
ITFT may be computed using at most ln/2 + L butterfly operations, and we will 
see that this estimate holds for our cache-friendly TFT and ITFT algorithms as 
well. Furthermore, in the multiplication algorithm sketched above, only n pointwise 
multiplications are performed, compared to the L multiplications incurred by the 
standard FFT method. Therefore, in this simplified algebraic complexity model, 
the ratio of the running time of the TFT/ITFT-based multiplication algorithm 
to the running time of the usual FFT multiplication algorithm is n/L + 0{£^^), 
indicating that the performance is relatively smooth as a function of n. 
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Algorithms |0] and [O] below (CacheFriendlyTFT and CacheFriendlyITFT) 
implement the TFT and ITFT in a cache-friendly manner. They operate on an 
array xq, . . . , xl-i, where L = 2^. In general all L elements of the array, even those 
elements not containing input or output, are used in intermediate computations. 

For the TFT, the first z elements are expected to contain the inputs ag, . . . , az-i, 
and the outputs oq, ■ • ■ , On-i are written in-place to the same array. For the ITFT, 
the first z elements are expected to contain the inputs 

and the outputs Lgq, . . . ,La„_i (optionally followed by a„ if / = 1) are written 
in-place to the same array. 

Both algorithms make use of the following well-known decomposition of ([T]) . Let 
L = L1L2 where Li = 2'^^ and L2 = 2^^ (go that ^1 + £2 = Write i ^ i2 + ^2*1 
where < ii < Li and < i2 < L2, and similarly for j. Then j' = j[ + -L1J2, 
where and j'2 are respectively the length-^i and length-^2 bit-reversals of ji and 
j2- We obtain 

L,-l Li -1 

<^J2+L2h - ^ 2^ 2^ ai2+L2ii 




L,-l 



^LT^^l2+L2il 



Therefore if we put 

Li-l 

(2) b, = bu,+L2k, = (C^^)'^^ E C'^'^^+L 



2™' 

m=0 



we obtain 



L2-1 

(3) a, = (C^O^'^ E '^SW2.x- 

r-O 

In other words, if a, b and a are thought of as Li x L2 matrices, then 6 is the result 
of applying an appropriately weighted DFT to each of the columns of a, and a is 
the result of applying an appropriately weighted DFT to each of the rows of h. 

For the base case L = 2 the routines compute the TFT/ITFT directly. If i = 
2^ > 4, they write L = L1L2 where Li = 2L^/2J j^^^j 2.2 = 2^'''/'^\ so that 1 < 
Li < L and 1 < L2 < L. They treat the array as an Li x L2 matrix, and recurse 
into TFTs/ITFTs on the columns and rows. The column transforms correspond 
to recursively applying the TFT/ITFT to the transform given by ([2]); the row 
transforms similarly correspond to the transform given by ([s]). (Van der Hoeven's 
TFT and ITFT algorithms are essentially the special case obtained by taking Li — 2 
and L2 = L/2.) 

We will denote by the u-th column {x^, x^+i^^, . . . , x^j^(^]^^_i^]^^) and by r„ the 
w-th row {xuL2-iXuL2+i^ • ■ ■ 5 XuL2+L2-i)- A real implementation would use auxiliary 
variables to describe such sub-arrays; for example, a pointer to the first element 
and a stride parameter. 

Common to both routines is the decomposition n — n2+L2ni where < rii < Li 
and < n2 < L2, and where ni = Li implies 712 = 0. This partitions the first n 
cells of the array into rii complete rows followed by 712 cells in the subsequent row. 
The parameter z is decomposed similarly into Zi and Z2- 
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3. A CACHE-FRIENDLY TFT 

We first consider the TFT; the idea is to compute only those parts of the DFT 
that are requested. We handle the column transforms first, followed by the row 
transforms. 

Algorithm 1: CacheFriendlyTFT(L, z, n; (xq, . . . , Zi^i)) 

Input: L = 2^ > 2, C e i?"", 

l<z<L, l<7i<L, 

Xi = ai for < i < z 
Output: Xi = hi for < « < n 

1 ii L ~ 2 then 

// base case 

2 if ri = 2 and z ~ 2 then (a;o, xi) ^ {xq + xi, Ci^o — xi)) 

3 if n = 2 and z = 1 then xi ^ (xq 

4 if n = 1 and z = 2 then 2:0^x0 + xi 

5 return 

6 end 

// recursive case 

7 Li ^2L^/2J, L2 ^2^^/21 

8 n2 ^ n mod L2, ni ^ ["/^aj , n[ ^ ["/^a] 

9 Z2 ^ z mod ^2, zi ^ [z/i2j 

10 if zi > then Z2 ^ ^2 else Zj ^ Z2 

// column trciasforms 

11 for < u < Z2 do CacheFriendlyTFT(Li, w^^C, zi -t- 1, ; Cu) 

12 for Z2 < M < Z2 do CacheFriendlyTFT(Li, zi, ri'^; c„) 

// row transforms 

13 for < u < ni do CacheFriendlyTFT(L2, ("^S 4j -^2; 

14 if ^2 > then CacheFriendlyTFT(L2, C'^S 4^ "2; J^m ) 



Theorem 1. Algorithm^ correctly computes the TFT. The base case is executed 
at most min((n — 1)^/2 + L — 1, Li/2) times. 

Proof. We first consider the base case L = 2. The relevant DFT is given by 
(ao,ai) = (ao + ai,C(ao — oi))- If z = 1 then oi — 0, and the transform be- 
comes simply (ao, Si) — (oq, Cio)- H n = 2 then both ao and ai must be computed; 
if n = 1 then only ag is needed. Lines [2]-|4] handle the various cases. 

Now we consider the recursive case, for L — 2^ > i. Figures |3ja)-(c) show the 
possible input configurations, for L = 64, Li = L2 = 8. Cells labelled a contain 
some at] cells labelled • contain uninitialised data, but implicitly represent ai = 0. 
Diagram (a) shows the case zi = 0, in which case Z2 = Z2. Diagram (b) shows the 
case Zi > and Z2 = 0, and diagram (c) shows the case zi > 0, Z2 > 0. In these 



latter cases Z2 = L2. Lines 11 12 apply the TFT recursively to the columns to 



evaluate the first n'^ rows of |2r. Line 11 handles those columns containing zi + 1 



nonzero entries; line |12| handles those containing only zi nonzero entries. 
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Figure 3. Before line[lT]of CacheFriendlyTFT. 
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Figure 4. After line O of CacheFriendlyTFT. 
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Figure 5. After lines nMS of CacheFriendlyTFT. 



After lines 11 ■ 12 have executed, we have Xi = hi for < zi < n'^ and < 12 < 4' 



and we also know that hi — Q for Z2 < i < (the latter statement is non- vacuous 
only if Zl = 0). Figure |4] illustrates the situation: cells labelled h contain some hi, 
cells labelled • contain unspecified data but implicitly represent 5^ = 0; cells labelled 
? are meaningless. Diagram (a) shows the case z'2 < L2, and diagram (b) shows 
z'2 ~ ^2- 

apply the TFT recursively to the first n\ rows to evaluate (Isl 



Next, lines 13 



14 



Figure [5] shows the possible output configurations. Cells labelled a contain some 
di] cells labelled ? contain meaningless data. Diagram (a) shows the case 71-2 > 0, 
where n'^ = ni + 1, and diagram (b) shows the case 712 = 0, where n'l = jii. Line 
[T3] handles the first ni rows, where hi must be computed for Q < 12 < L2; line [T4| 
handles the remaining partial row, where is needed only for < ^2 < ^2- 
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We prove the complexity estimate by induction on L. For L = 2 the bound is 
min((n — l)/2 + 1, 1) = 1, so the estimate holds. Now assume that L > 4, and let 
£i = log2 Li and £2 = loga L2. 

We first verify that the number of calls to the base case is bounded by L£/2. By 
induction, lines [TT 12 call the base case at most L2{Li£i/2) times, and lines 13 -[TT 



call it at most n'i(L2^2/2) < Li(L2£2/2) times. The sum is LiL2(^i+4)/2 = L£/2. 

Second, we must verify that the number of calls is bounded by {n~\)£/2 + L — 1. 
Let 5 — n'^ — TLi e {0, 1}. Lines 11-[T2 call the base case at most L2{{ni+5 — \)£i/2- 



Li — 1) times, line 13 calls it at most 711(^2^2/2) times, and line 14 calls it at most 



5{{n2 — 1)^2/2 + L2 — 1) times. The sum of these terms is \X + Y where 

X ^ L2{ni - 1)4 + niL2£2 + S{L2£i + K - 1)4) 

= (n - n2)£ - L2£i + 5{L2£i + K - 1)4) 

= (n - l)£ +{5- 1)L24 + ("2 - l){5£2 - £), 
Y - £2(^1 - 1) + 5(^2 - 1) = i - 1 + (5 - 1)(L2 - 1). 

If (5= 1, then n2 > 1 and {n2 ~ 1){S£2 - £) = -^K-l) < 0. If 6 ^ then ^2 = 
and {6 — 1)^2^1 + {n2 — 1)((5^2 — ^) = —^2^1 + £1 + £2, which is non-positive since 
1/2 = 2^^ > £2 + The desired estimate holds in both cases. □ 

4. A CACHE-FRIENDLY INVERSE TFT 

The ITFT cannot be implemented by simply running the TFT in reverse, because 
when the ITFT commences there is insufficient information to perform all the row 
transforms. In particular, if n ^ mod L2, then the [n/i2j-th row contains some 
di but does not contain the corresponding bi needed to apply 

To circumvent this difficulty, we proceed as follows. We first perform as many 
row transforms as possible. We are then able to perform some of the column 
transforms. When these are complete, it becomes possible to execute the last row 
transform that was inaccessible before. After this row transform, the remainder of 
the column transforms may be completed. Algorithm [O] gives a precise statement. 

Theorem 2. Algorithm^ correctly computes the ITFT. The base case is executed 
at most min((n + / - 1)^/2 + L - l,L£/2) times. 

Proof. We first consider the base case L = 2. As before, the relevant DFT is given 
by (So, ai) = (oq + ai, C(ao — ai))- If n = 2, then we must have z = 2 and / = 0, 
and we are computing the map (do, Si) i— > (2ao,2ai) = (ao + C^^aijSo — C^^ai). 
This is handled by line|2] Now suppose that n — 1. If / = 1 and z = 2, we must 
compute the map (ao,2ai) 1-^ (2ao,ai) — (2ao — 2ai, C(ao — 2ai)) (van der Hoeven's 
'cross butterfly'). This is handled by line [3] Lines |4j|6] handle the analogous cases 
where / = (the second output is not needed) or where z = 1 (oi is assumed to 
be zero). Finally suppose that n = 0. Then we must have f — 1. If z = 2, we must 
compute (2ao,2ai) 1-^ aa — (2ao + 2ai)/2. This is handled by line [7] The z — 1 
case (where we assume oi = 0) is handled by line[8] 

We now suppose that L > 4 and consider the four cases below. Figures [6 10 



illustrate the various stages of the algorithm for each of these cases. Cells labelled 
a, b and d indicate respectively Lai, ^2^1 or df, cells labelled • are uninitialised, 
but implicitly represent = 0; cells containing ? contain unspecified data not used 
in subsequent computations. A symbol in parentheses indicates that the symbol is 
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Algorithm 2: CacheFriendlyITFT(L, C, z, n, /; {xq, . . . ,xl-i)) 

Input: L^2'^ >2, C e R"", 

/ e {0, 1}, 1 < n + / < L, 1 < z < L, z > n, 

Xi ~ Obi for < i < n, = La^ for n <i < z 
Output: Xi — Lai for < z < n, 
a;„ = a„ if / = 1 

1 if i = 2 then 

// base case 

2 if ri = 2 then (xq, xi) (xq + ('^"'"Xi, xo — C^"'"2;i) 

3 if n = 1 and / = 1 and z — 2 then (xq, Xi) ^ (2xo — Xi, C(a:^o ~ 2:^1)) 

4 if n = 1 and / = 1 and z = 1 then (xq, xi) ^ (2xo, C^^o) 

5 if n = 1 and / = and z = 2 then xq <— 2xo — x\ 

6 if n = 1 and / = and z 1 then xq ^ 2xo 

7 if n = and z = 2 then xq ^ (xq + xi)/2 

8 if n = and z —\ then xq ^ xo/2 

9 return 

10 end 

// recursive case 

11 Li <-2L^/2J, L2 ^2^^/21 

12 <— n mod L2, ni <— [n/L2j 

13 Z2 ^ z mod L2, Zi ^ [z/i2j 

14 if ^2 + / > then /' ^ 1 else /' ^ 

15 if zi > then Zj ^ ^2 else Z2 ^ Z2 

16 m «— niin(n2,Z2), m' niax(n2,Z2) 

// row tranforms 

17 for < U < ni do CACHEFRIENDLYlTFT(L2,C^S-^2,i2,0;r„) 

// rightmost column trcinsforms 

18 for n2 < u < ni' do CacheFriendlyITFT(Li, a;]^^, zi + l,ni,/';c„) 

19 for m' < u < Z2 do CacheFriendlyITFT(Li, zi, rii, /'; c„) 

// last row transform 

20 if /' = 1 then CacheFriendlyITFT(L2, C^S 4, "-2, /; ^m) 
// leftmost column trainsforms 

21 for < u < TO do CacheFriendlyITFT(Li, cl;]^^) zi + 1, + 1, 0; c„) 

22 for TO < M < ^2 do CACHEFRiENDLYlTFT(ii, cj]5C; zi,ni + 1, 0; c„) 



only valid if / 1; if / the cell behaves like a ? cell. Cells in bold are those 
about to be transformed by a recursive call. 

Case (a): zi = 0. This implies that < n2 < Z2 = Z2 < L2 
and /' 



TO' — Zi, and /' = 1. Line 17 has no effect since ni = 
Xi = Lihi for ^2 < i < Z2, and destroys Xi for 712 < ^2 < ^2 
has no effect since Z2 = Z2. Line 20 computes Xi = L26 
— x„ — a„ if f ^ 1, and destroys Xi for ni + f < 



ni 
Line 



18 



0, TO = ni, 
computes 
1 < zi < Li. Line [19] 
for < i < ni, computes 
< Li. 



Line 21 computes 
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Figure 6. Before line [Tt] of CacheFriendlyITFT. 
rows are about to be transformed by line |17| 
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Figure 7. After line [17] of CacheFriendlyITFT. 
columns are about to be transformed by lines [T8|jT9| 
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Figure 8. After lines nHHla of CacheFriendlyITFT. The bold 



row is about to be transformed by line 20 
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Figure 9. After hne [20] of CacheFriendlyITFT. The bold 
columns are about to be transformed by lines 21-22 
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Figure 10. After lines EH22 of CacheFriendlyITFT. 



Xi = Lai for < i < n2 — n, and destroys Xi for < *2 < ?^2, 1 < ii < ^i- Line |22| 
has no effect since m = n2- 

Case (b): zi > and n2 = 0. This implies that zi > ni > 0, z'2 — L2, m — 0, 
m! = Z2 and /' — f. Line 17 computes Xi = iv2^i for < i < niL2 = n. Lines 
- Lai for < i < niL2 — n, and if / = 1 also compute Xi = L2bi 

1 , then line 



18 -19 compute Xi 



for Q < 12 < L2, h — ni; they destroy Xi for L2(ni + f) < i < L. If f 



20 computes XniL2 = Xn = an and destroys Xi for niL2 < i < {ni + l)L2. Lines 

21 [22] have no effect since m — n2 — 0- 

Case (c): zi > 0, 712 > and n2 < Z2. This implies that = L2, < ni < Li, 
m = 712, = Z2, and /' = 1. Line 17 computes Xi ~ L2hi for < z < niL2. 
For each 712 < *2 < L2, lines 18 19 compute Xi = Lai for < *i < 't-i, compute 
Xi — L2bi for ii — ni, and destroy Xi for ni < ii < Li. Line [20| computes 
Xi = hi for < i2 < n2, ii = n-i, computes a;„ = a„ if / = 1, and destroys Xi for 
n2 + f < 12 < L2, ii — ni. Finally, for each < ^2 < ^^2, lines [2TH22] compute 
Xi = Lai for < ii < rii + 1 and destroy Xi for ni + 1 < ii < Li. 

Case (d): zi > 0, n2 > and 712 > Z2. The discussion for this case is essentially 
the same as for (c), with m and m! exchanged, and with slightly different diagrams. 

Now we verify the complexity bound. The argument is similar to that used for 
the TFT. For L = 2 the bound is min((n + / - l)/2 + 1, 1) = 1, so the estimate 
holds. Now assume that i > 4, and let — \og2 Li and £2 — log2 L2- 

We first verify that the number of calls to the base case is bounded by L£/2. 
By induction, lines 18 19 and 21 22 call the base case at most L2{Li£i/2) times 
altogether. Lines 17 and 20 call it at most Li(L2^2/2) times (note that if line [20] is 
executed then ni < Li - 1). The sum is -^1^2(^1 + ^2)/2 = Li/2. 

Second, we must verify that the number of calls is bounded by {n + f — l)£/2 + 
L — 1. Line 17 calls the base case at most ni(L2^2/2) times, lines 18]- 19 call 
it at most (L2 — n2){{ni + f — l)^i/2 + Li — 1) times, line 20 calls it at most 
/'((n2 + /-lK2/2- 



times. The sum of these terms is 



L2 — 1) times, and lines 21 22 call it at most n2(ni£i/2+Li — 1) 



Y, where 

X = niL2£2 + L2{ni + f - l)£i - (/' - 1)^2^1 + f{n2 + f- 1)^2 
= (n - n2)e + (/' - 1)(L2 - n2)£i + f'{n2 + f- 1)4 
^{n+f-l)£+ (/' - 1)(L2 - n2)£i + K + / - 1)(£2/' - £), 

Y = L2(ii - 1) + /'(i2 - 1) = i - 1 + (/' - 1)(L2 - 1). 

If /' = 1 then 712 + / > 1 and the bound follows since ^2/' - ^ = -^1 < 0. If /' = 
then 712 = / = and the bound follows since —^2-^1 + ^ < (as in the proof of 
Theorem [T]) . □ 
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Figure 11. Performance of several implementations of the 
Schonhage-Strassen algorithm for 8000-bit coefficients 



5. Empirical performance and applications 

5.1. The Schonhage-Strassen algorithm. Both the Magma computer algebra 
system (version 2.14-15, |BCP97] ) and Victor Shoup's NTL library (version 5.4.2, 
|Sho07j ) use the Schonhage-Strassen algorithm |SS71] for multiplication of dense 
polynomials in Z[x] when (roughly speaking) the coefficient size of the input poly- 
nomials (in bits) is larger than their degree. The algorithm may be sketched as 
follows. Suppose that f,g € Z[x], and put h = fg. Let R = Z/(2'=^/2 + 1)Z, where 
we choose N = 2" > deg h and kN/2 larger than the size of the coefficients of h. 
Multiply the polynomials in R[x]/{x^ — 1), using an FFT with respect to the prin- 
cipal iV-th root of unity = 2*'' G _R, and lift the result back to Z[x]. Arithmetic 
in R is especially efficient owing to the ease of reduction modulo 2^^!"^ + 1 and of 
multiplication by powers of wjv- 

The author, in joint work with William Hart, implemented the Schonhage- 
Strassen algorithm using the techniques of this paper to improve smoothness and 
locality. The implementation is part of the f mpz_poly module in the FLINT li- 
brary (version 1.0.13, |HH08] 1 . which is used as the default back-end for arithmetic 
in Z[a;] in the Sage computer algebra system (version 3.1.1, |SJ05] ). 

The following performance measurements were conducted on a 16-core 2.6GHz 
Opteron server running Ubuntu Linux. This is a 64-bit processor with a 64 KB 
LI cache and 1 MB L2 cache. Only a single core was used for the tests. Our own 
code and NTL were compiled with gcc 4.1.3, and linked with GMP (GNU Multiple 
Precision Arithmetic Library, |Gra08] ) version 4.2.3. We also applied an assembly 
patch of Pierrick Gaudry that improves the performance of GMP on the Opteron. 
Magma also uses Gaudry's patch, and links statically against GMP. 
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Figure [TT] compares four implementations for the case of polynomials with ran- 
dom non-negative 8000-bit coefficients, with lengths ranging from 512 to 16384 in 
5% increments. The graphs for Magma and NTL exhibit the jumps characteristic of 
FFT-based multiplication algorithms. The two graphs for FLINT show the multi- 
plication performance obtained for van der Hoeven's divide- and-conquer truncated 
transforms, and for the cache-friendly truncated transforms. The latter is between 
15% and 35% faster than the former for this range of polynomial lengths, and 
the relative improvement in performance increases with the degree. Note that the 
Fourier coefficients are about 16000 bits long 2 KB), so about 32 coefficients fit 
into the LI cache and about 512 coefficients fit into the L2 cache. 



5.2. The Schonhage— Nussbaumer algorithm. The author implemented the 
cache-friendy transforms in the context of the Schonhage-Nussbaumer algorithm 
|Sch771 lNus80] for multiplication in S[x] where S = Z/mZ and where m is an odd 
word-sized modulus. The implementation is part of the zn_poly polynomial arith- 
metic library (version 0.9, |Har08b| ). The code has been used in several number- 
theoretic applications, including computations of zeta functions of hyperelliptic 
curves over prime fields of large characteristic |Har07| . computations of L-functions 
of hyperelliptic curves over Q |KS08| . computing Hilbert class polynomials |Sut08j . 
and an ongoing project with Joe Buhler to extend the verification of Vandiver's 
conjecture and computation of irregular primes and cyclotomic invariants carried 
out in [BCE+01| . 

The basic idea of the Schonhage-Nussbaumer algorithm is to split the input 
polynomials into pieces of length M/4, and then map the problem to a convolution 
in R[z]/{z^ — 1) for R = S[y]/{y^'^^^ + l), where K\M so that R contains a principal 
K-th root of unity (namely y^^^), and where K is large enough to accommodate 
the product. Our implementation performs the FFTs over R using the transforms 
of f|3] and ^|4] ensuring relatively smooth performance as a function of the input 
polynomial length. The pointwise multiplications are handled using a multipoint 
Kronecker substitution method [HarOSaj . switching to Nussbaumer 's algorithm for 
sufficiently large M. (Note that we do not perform an FFT over Z/mZ; such an 
FFT is usually not possible since Z/mZ rarely contains appropriate roots of unity.) 

We compared the performance of the cache-friendly transforms to the divide- 
and-conquer transforms for a range of polynomial lengths (lO'' to 3 x 10^) and 
modulus sizes (5 to 63 bits). We observed a modest improvement in speed of up to 
15%, depending on the polynomial length and modulus. As expected, polynomials 
of higher degree enjoy a greater relative improvement, as locality plays a greater 
role in such multiplications. Somewhat counterintuitively, the modulus size had 
the opposite effect on relative performance. This may be explained by noting that 
the FFTs in our implementation operate on arrays with each element of Z/mZ 
occupying a single machine word, so the total FFT time does not depend on the 
modulus; on the other hand, the pointwise multiplications are faster for smaller 
moduli, as the Kronecker substitution reduces them to smaller integer multiplica- 
tions. The implementation thus spends a smaller proportion of the total time in 
the FFTs when the modulus is larger, leading to a smaller relative improvement 
derived from the cache- friendly transforms. 
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6. The small coefficient case 

In the applications described in fjS] elements of the coefficient ring R occupy 
moderately large blocks of memory. However, FFTs are also commonly applied 
over 'small' coefficients, such as double-precision floating point numbers, or residues 
modulo a word-sized prime p where Z/pZ contains suitable roots of unity. We have 
not attempted an implementation in this context, but in this section we make 
several relevant observations. 

An essential consideration in the small coefficient case is spatial locality, which 
we have largely ignored in this paper. In typical contemporary cache hardware, 
the cache is organised into cache lines, each capable of storing several words from 
consecutive locations in main memory. If an algorithm operates on coefficients 
spaced out in memory, then only a single word of each cache line will be utilised, 
greatly reducing the effective size of the cache. Moreover, the mapping from physical 
addresses to cache lines often depends on only the last few bits of the address. 
If two coefficients are separated by a large power-of-two distance in memory — 
exactly the situation during the column transforms of a matrix FFT — then the 
cache cannot simultaneously hold both of them (although this can be mitigated to 
some extent by cache associativity). The standard solution to these problems is 
to transpose the matrix for the duration of the column transforms, using a cache- 
friendly matrix transpose algorithm, so that the subtransforms always operate on 
consecutive data. A similar approach would be needed to adapt our TFTs/ITFTs 
to the small coefficient case. 

A second remark is that in the small coefficient case, it is quite reasonable to zero- 
pad the inputs so that there is no 'partial row'. The rationale is that the lowest level 
of cache can hold a large number of coefficients, making the penalty for zero-padding 
quite small. For example, suppose that the cache can hold 2^"^ coefficients (typical 
for a 64KB LI cache with double-precision floating-point coefficients), and that we 
are multiplying polynomials whose product has length n = 12801 = 100 -2^-1-1. 
This requires a transform length of 2^^, which we may decompose into a 2'^ X 2^ 
matrix. If we zero-pad the inputs so that n increases to 12928 = 101 -2^, an integral 
number of rows, the running time penalty incurred is at most 1%. This approach 
simplifies the ITFT routine considerably, since it may be implemented by simply 
reversing the steps of the TFT, removing the need for the special row transform 
(line 20 of Algorithm [o]) . The reduction in code complexity is likely worthwhile. 



We also note that the presence of a partial row makes it more difficult to maintain 
spatial locality during the special row transform. 

Finally, in the implementations described in ^Js] the parameter ( = oj" is repre- 
sented simply by the integer s. Wit h th is representation, computing roots of unity 
(for example, computing in line 13 of Algorithm O) is very cheap compared to 



the cost of arithmetic in R. In the small coefficient case this is no longer neces- 
sarily true, and the cost of computing or storing roots of unity must be taken into 
account. 
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